Link to merged notebook:
+https://github.com/TheRensselaerIDEA/COVID-Notebooks/blob/master/MATP-4400-FINAL/AbrahamSanders_FINAL_2020.Rmd +https://github.com/TheRensselaerIDEA/COVID-Notebooks/blob/master/MATP-4400-FINAL/AbrahamSanders_FINAL_2020.html
Code is mostly excluded from the knit .html version of this notebook to maintain a clean presentation. It is included in a few places that make sense for demonstrative purposes. The full code is provided in the accompanying .Rmd file.
To run the .Rmd file, make sure the included dependencies Elasticsearch.R and elasticsearch_queries.R are in the same directory as the Rmd, and make sure to set “elasticsearch_host” to the approriate value here (this is not included in the github version for security reasons).
elasticsearch_host <- ""
Social media provides a rich corpus of text characterizing a real-time view of daily happenings and current events within our communities. We happen to be living through what may turn out to be a historically significant current event - that is, the COVID-19 pandemic. The pandemic, which originated in the Wuhan area of China in December of 2019, began making a significant impact in the United States in mid March 2020. This project explores the impact that COVID-19 has had on higher education both in the United States and worldwide by studying themes and topics being discussed on Twitter in the midst of that mid-March inflection point as people were grappling with the new reality of a quarantined society.
A method for gathering topic-focused tweet datasets was originally developed as part of a separate undergraduate research project this term here at RPI. This effort was initially focused on using the Twitter streaming API [2] to gather tweets related to the opioid crisis in an effort to reproduce results in classifying indicators of opioid abuse on Twitter published by researchers at the University of Pennsylvania [1]. The tooling built for that goal was later repurposed to gather tweets related to the COVID-19 pandemic, and the resulting dataset contains over 40 million tweets gathered between March 17th and April 15th 2020. This dataset also includes tweets prior to March 17th that were retweeted during the gathering interval. All of the tweets in this repository were selected by a filter keyword list targeting areas of interest with respect to COVID-19. The keyword list can be found here.
For this exploratory analysis, tweets pertaining to higher education will be sampled from the COVID-19 tweet dataset from a three-day period in mid-March. By studying the themes, topics, and sentiments being discussed on Twitter with respect to higher education between at this point in time, a sense of the initial impact of COVID-19 on higher education can be established.
A method for sampling tweets from the dataset by relevance to a semantic phrase was also developed as part of the aforementioned undergraduate research project. To get samples of tweets pertaining to higher education, we embed the phrase “colleges and universities” into a high-dimensional vector space and calculate the cosine similarity between that phrase and all the tweets in the dataset for the requested time range. This query is executed on our dataset using Elasticsearch [3].
For example, here are tweets relevant to the phrase “colleges and universities” tweeted between March 1st and March 15th 2020:
elasticsearch_indexname <- "coronavirus-data-all"
results <- do_search(indexname=elasticsearch_indexname,
rangestart="2020-03-01 00:00:00",
rangeend="2020-03-16 00:00:00",
semantic_phrase="colleges and universities",
resultsize=10,
resultfields='"created_at", "user.screen_name", "text", "extended_tweet.full_text"',
elasticsearch_host=elasticsearch_host,
elasticsearch_path="elasticsearch",
elasticsearch_port=443,
elasticsearch_schema="https")
#print results
params.df <- data.frame(from=results$params$rangestart,
to=results$params$rangeend,
phrase=results$params$semantic_phrase,
results.count=paste(nrow(results$df), "/", results$total))
kable(params.df) %>% kable_styling()
| from | to | phrase | results.count |
|---|---|---|---|
| 2020-03-01 | 2020-03-16 | colleges and universities | 10 / 48284 |
display.df <- results$df[, c("cosine_similarity", "full_text", "created_at", "user_screen_name")]
kable(display.df) %>% kable_styling()
| cosine_similarity | full_text | created_at | user_screen_name |
|---|---|---|---|
| 0.4089483 |
Universities are a key part of any government’s evidenced-base decision-making, but more unis are assessing the evidence base and arriving at a different conclusion from the UK govt - Universities rapped by government over coronavirus ‘closures’ https://t.co/yEuwloJtQG |
Sun Mar 15 09:22:30 +0000 2020 | DrCORourke |
| 0.3773400 | Colleges finally concede: you don’t need to come here anymore. It’s just as good (even better) to take classes online. Most online schools are free. We’ve been raising tuitions 10% a year for 50 years only because you’ve been stupid enough to borrow it and pay us. | Fri Mar 13 14:31:29 +0000 2020 | jaltucher |
| 0.3551096 | American academia, folks. https://t.co/RFOy74nCUw | Thu Mar 12 07:45:48 +0000 2020 | whitesundesert |
| 0.3506271 | My @UWMadison peeps- especially my current and past @CMB_UWMadison students and alum- this is your time! Please RT | Sun Mar 15 11:28:54 +0000 2020 | ScienceByNadia |
| 0.3459380 | Universities and colleges may as well end the charade of switching to online courses; liberate their faculty and students; provide instead a Pass/Fail grade option based on work done so far since it is already mid-semester; and begin classes again next Fall or whenever. | Thu Mar 12 17:49:35 +0000 2020 | Ca_Rule |
| 0.3341664 |
Usually recessions are great for universities; people go back to school. But with the amount of fixed costs involved, if Universities can’t open and don’t bring in tuition money for their summer/fall semesters it’s going to be Armageddon in higher education. |
Sun Mar 15 18:33:56 +0000 2020 | Austen |
| 0.3331325 | YouTube videos and libraries is that we have a wide range of access to information and technology that can allow us to make colleges and universities very cheap, or yet…free | Sun Mar 08 13:11:22 +0000 2020 | DarkerAlpha |
| 0.3273025 | List of Colleges closing due to Coronavirus: - Not yours lol | Wed Mar 11 05:49:56 +0000 2020 | ashishparmar99 |
| 0.3018154 | All of these colleges and universities closing and sending students home with little notice… are they refunding students their housing & meal plan costs? For students on need scholarships, is that money being given to them in cash so they can afford to live and eat elsewhere? | Tue Mar 10 13:26:29 +0000 2020 | JillFilipovic |
| 0.2967776 | All students in college right now should receive a “COVID-19 Scholarship” next year since we are paying for resources we aren’t even using | Fri Mar 13 15:58:23 +0000 2020 | feltmanjosh |
Tweets with higher cosine similarity scores are more semantically relevant to the search phrase “colleges and universities”. The embedding vectors are provided by Universal Sentence Encoder [4], a transformer [6] neural network architecture pre-trained on a text similarity ranking task. Tensorflow [5] is used to run this embedding model.
To organize a large sample of higher-education related tweets, k-means [7] clustering is used on the embedding vector space to group together semantically related tweets. We call the resulting high-level clusters “themes”. These clusters are labeled using a term-frequency ranking - the top three most frequently used non-stopword terms in each cluster become the respective theme label.
Within each theme cluster, k-means is run again to organize the theme cluster into subclusters which we call “topics”. Topic subclusters are labeled in the same manner as theme clusters, with the restriction that a topic cluster may not contain any terms in its label that are already in the theme cluster label.
For example, if a theme is “health / medical / workers”, these three terms may not be included in any of the topic subclusters within.
To visualize theme clusters and topic labels, T-SNE [8] is used to project points from the original 512 dimensional embedding vector space into two dimensions.
# query start date/time (inclusive)
rangestart <- "2020-03-18 00:00:00"
# query end date/time (exclusive)
rangeend <- "2020-03-21 00:00:00"
# query semantic similarity phrase
semantic_phrase <- "colleges and universities"
# number of results to return (max 10,000)
resultsize <- 10000
To find the optimal number of high-level theme clusters for this sample, an elbow plot is used:
The plot mostly represents a smooth curve, although there is a distinct “elbow” point between k=8 and k=10. We will select k=8:
k <- 8
To find the optimal number of topic subclusters for each theme cluster, another elbow plot is generated with a curve for each theme cluster:
Each theme cluster follows a similar plot, again representing a smooth curve. This time there is no clear “elbow” point. A reasonable choice of k can be selected anywhere between 8 and 15. We will select cluster.k=8 for the topic subclusters:
cluster.k <- 8
## [1] "Subclustering cluster 1 ..."
## [1] "Subclustering cluster 2 ..."
## [1] "Subclustering cluster 3 ..."
## [1] "Subclustering cluster 4 ..."
## [1] "Subclustering cluster 5 ..."
## [1] "Subclustering cluster 6 ..."
## [1] "Subclustering cluster 7 ..."
## [1] "Subclustering cluster 8 ..."
## [1] "Plotting cluster 1 ..."
## [1] "Plotting cluster 2 ..."
## [1] "Plotting cluster 3 ..."
## [1] "Plotting cluster 4 ..."
## [1] "Plotting cluster 5 ..."
## [1] "Plotting cluster 6 ..."
## [1] "Plotting cluster 7 ..."
## [1] "Plotting cluster 8 ..."
Inspection of the master plot shows that the main themes being discussed with respect to higher education at this point in mid-March are:
The nearest neighbors to these theme cluster centers display some of the discourse central to these issues:
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 3103 | 0.7772874 | @msleeplessagain i heard the news say that schools will be open on friday and from then on they’ll be closed until further notice | she |
| 9640 | 0.7766419 | @Mr_AlmondED Most schools are going to have to remain open. | |
| 3089 | 0.7719161 | @Jackstarbright @Miss_Snuffy Schools will still be open of course. Just not for most pupils. | London |
| 5306 | 0.7688661 | @slutdropstuart they’re closing all the schools as of friday apparently so looks like it’s all happening now | cyndee |
| 7728 | 0.7674363 | @skwawkbox They’ve just announced schools closed as of Friday til further notice, exams cancelled. | Scotland, United Kingdom |
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 4850 | 0.7388429 | well…my university has officially cancelled all in-person classes through the rest of the semester and this is the first time in my life I have cried over NOT being able to go to school | h-town |
| 9273 | 0.7306517 | I just found out classes are cancelled til April 17th, guess who’s not gonna get to graduate at the end of the semester | Michigan, USA |
| 114 | 0.7192284 | So now I don’t have any college anymore until they open schools and stuff again funnnnn | |
| 8729 | 0.7146884 | I can’t even be bothered w school rn. They need to just throw the whole semester away . Over it | |
| 1140 | 0.7103943 | I can’t believe college is cancelled this semester | My room |
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 734 | 0.7857939 | Given the impending disaster here in the UK (the government is shit, as usual), I think the Uni should cancel the exams or at the very least delay them - our students will have enough to deal with, as we all will. @UniStrathclyde | University of Strathclyde |
| 6834 | 0.7851976 | The PM just confirmed that exams are cancelled but with no clarification of how GCSEs and A-levels will be awarded. Young people across the country now facing so much uncertainty, with university places relying on these grades. | ldn // lancs |
| 4658 | 0.7844182 | This is ridiculous! DO NOT POSTPONE THE EXAMS. It is entirely unfair on students everywhere! Find another way to make sure we get the grades we need, as suggested In this petition #examscancelled #Exams #GCSEs #ALevels #coronavirus #CoronaVirusUpdate | UK |
| 3191 | 0.7784155 | Instead of cancelling GCSEs and Alevels, they should’ve postponed them or something. This won’t be fair on current students, who may outperform their predicted, nor on previous students | uk |
| 8760 | 0.7748671 | Everyone in GCSE landing with exams being cancelled; then us ALEVEL students are bricking it!! We need certain qualifications and grades for uni but not guaranteed them… shitting myself more now than I was for the exams. | Maesteg, Wales |
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 1331 | 0.7617859 | @SenWarren This is your pipe dream! Cancelling student loan debt means the tax payers foot the bill, many of which paid their own way already. It is not right to stick the bill that these students signed on to the tax payers. Tell these universities to teach for free! | 452020 |
| 533 | 0.7601979 | Even if i was gone graduate with $0 in student loans, i STILL would want them shits cancelled because college should be free for everybody anyway. | EAZ6 |
| 757 | 0.7561739 | More than 45 million Americans struggle with $1.6 trillion in student debt. We must cancel all student loan payments for the duration of this emergency. Long-term, we must cancel all student debt and make public colleges, universities, and trade schools tuition- and debt-free. | Vermont |
| 612 | 0.7459218 | @alyssaruder @SanfratelloLexi Waiving student loans is just part of a bigger idea: college being accessible for everybody. Be empathetic towards people who want to get education beyond high school but can’t because of how expensive it is. Should they be stuck working undesirable, low-paying jobs? | Crete, IL |
| 858 | 0.7347559 | “We need to waive all student loan payments for the duration of the emergency. Long term we must cancel all student debt and make public colleges, universities, and trade schools tuition free.” @BernieSanders | United States |
Within each theme, we can dig into specific topics by inspecting topic subclusters. An exhaustive review of the topic subclusters is beyond the scope of this report, however it is prudent to display some of the interesting ones.
Within theme: General discussion (covid19 / universities / coronavirus)
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 3598 | 0.7357125 | @amandalorenz18 I seriously believe that this years team is on the cusp of greatness and poised to make a deep run in the WCWS. So saddened that the season has been canceled, but the NCAA did the right thing giving another year of eligibility. | |
| 2118 | 0.7208072 | In response COVID-19, the #NCAA has suspended all in-person recruiting effective immediately though April 15. Use this time to ensure you are academically eligible to play in college. Visit to sign-up for your College Athletic Report on Eligibility (CARE). | Chicago, IL |
| 1700 | 0.7134298 | NCAA D1, D2 and NAIA schools check out @SinclairMBB @jhendricks_99 2019-20 Highlights. 77 3pm. 40.5% from 3pt. Set school record with 11 3pm in a single game. Academic All-American Candidate. RS Freshman with 3 years left. | Dayton, Ohio |
| 3276 | 0.7134011 | NCAA needs to re-look this and reschedule if they can. So many student athletes at so many schools are being robbed of opportunities. C’mon @NCAA, get it together on this. | Worldwide |
| 2565 | 0.7077024 | @calliebell857 @tssaa High School seniors will not get an extra year of eligibility like they have given the College players and making it to State for most teams is once in a life time(not all teams but most) opportunity. We have a Senior and he has played his heart out to make it to State. |
Within theme: General discussion (covid19 / universities / coronavirus)
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 1034 | 0.7103367 | BROOKLANDS COLLEGE CAMPUSES ARE CLOSED We are moving to online learning and students will continue their study from home. Please visit our website for further information: STUDENTS - Access Your Google Classroom Here: | Surrey |
| 8391 | 0.7101338 | We are shifting classes online, offering student supports via distance and restricting access to campuses. Check the YC website for daily updates on what we are doing to address COVID-19 as well as helpful resources: #COVID19 #TeamYukon #YukonU | Whitehorse, Yukon |
| 1046 | 0.7014487 | Students: Are you looking for someone to answer your #COVID19 questions? Here are points of contact for public, 4-year institutions. As this situation rapidly evolves, stay informed! More information available on our #coronavirus resource page: | Trenton, NJ |
| 995 | 0.6979433 | Update 20th March We have updated our website to include the latest information for students and parents. Please visit this page frequently as this will be the single source of information outlining the College’s response to coronavirus (COVID-19). | Wales, United Kingdom |
| 3044 | 0.6920693 | Hey Cougars - Please find updated information concerning your classes going forward, limited campus hours, information about student housing, and resources available to you at this time online at For questions or concerns, email covid19@wncc.edu. | Scottsbluff, NE, United States |
Within theme: Transition to online classes (semester / classes / online)
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 1194 | 0.7584000 | school sucks until its cancelled and you realize you will miss ur classmates and the kids you saw every week for your field study | |
| 5582 | 0.7438353 | i wont have school until fucking april and who knows if ill even have school until i graduate or if im even graduating irl i hate it here and i actually really enjoyed my second semester classes and im gonna miss seeing my friends so bad but most of them arent allowed to go out | oty 0112 |
| 9456 | 0.7425337 | Why we gotta leave school when I’m finally actually enjoying my classes man?! this ain’t it | |
| 6781 | 0.7392802 | School being cancelled is cool and all but there’s this feeling of impending doom that we’ll be back at some point and teachers finna be like you have 8 months to study and you didn’t? Smh | She/they |
| 9115 | 0.7268790 | I hate school ,took off and decided to go back this semester and this happens, school really ain’t for me gods plan | Bensonhurst, Brooklyn |
Within theme: Transition to online classes (semester / classes / online)
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 1696 | 0.7993825 | It’s so sad that college commencements are being canceled. I feel especially sad for the first-generation graduates. I never stepped foot in a college graduation until my own. For those of you that have dreamed of that day, I see you. Your accomplishments still matter! | under da sea |
| 6319 | 0.7941301 | Sad to see seniors getting graduation canceled. A lot of you will get your degree and go straight into the workforce. You’ll be the beginning to try and fix the problems that this world will face. Work hard out of college, so future loved ones experience what you could not. | |
| 442 | 0.7812365 | Since graduation is basically cancelled, we decided to celebrate ourselves and our hard work. Fuck covid19 and fuck universities for not finding better alternatives to help us students out, especially those who have worked so long and hard to finish this last semester. | |
| 7998 | 0.7698545 | The people telling me not to be sad about my canceled graduation has either one: fucking graduated or two: never been to college a day in their lives. Pls leave me tf alone and let me be upset about this loss. | UNCG |
| 8083 | 0.7622852 | cancelling these folks graduation is a push now they could of AT LEAST postponed it…we work too damn hard |
Within theme: Impact on student finances (student / loans / debt)
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 1177 | 0.7811172 | Colleges should give refunds to every student affected by the coronavirus shutdowns. If you’re gonna kick them off campus, give them their money back. The taxpayer should not be expected to subsidize any of it. In before school administrators demand taxpayer dollars. | |
| 118 | 0.7750281 | Wow, these money hungry Universities are taking advantage of the coronavirus. They won’t give a refund on dorms, and they charge the same amount of money for online classes as if the students were attending live classes. Why do we need a campus to get a college education? | Southern California |
| 1363 | 0.7680780 | Y’all steady telling people to stop worrying about getting a housing refund, but most of y’all saying this school paid for.Y’all not coming out of pocket thousands of dollars to stay on campus. So let these people do wtf they want. In the end they’ll either get a refund or a no. | McComb, MS, United States |
| 409 | 0.7497391 | It’s truly incredible that such an expensive, famous school (which encouraged its students living on campus to go home) will not even consider partial refunds to tuition or housing when the educations we are going +$100k in debt for are essentially nonexistent right now. | Anaheim, CA |
| 183 | 0.7440381 | Colleges owe their students better answers than this. They paid for services they are not getting. Refunds should be issued. If that’s not feasible, give them credit to be used when school reopens. | Washington, DC |
Within theme: Impact on student finances (student / loans / debt)
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 757 | 0.8291112 | More than 45 million Americans struggle with $1.6 trillion in student debt. We must cancel all student loan payments for the duration of this emergency. Long-term, we must cancel all student debt and make public colleges, universities, and trade schools tuition- and debt-free. | Vermont |
| 1206 | 0.8240200 | Student loan debt should be cancelled because the loans were predatory (AND ARE. I think we forget this beast is still going and preying on a new generation). And when Banks got bailed out, all debt should’ve been wiped out that they facilitated/and the program stopped | The Firmament of Heaven |
| 2749 | 0.8085207 | Abolish student loans | Chicago, Illinois |
| 858 | 0.7959975 | “We need to waive all student loan payments for the duration of the emergency. Long term we must cancel all student debt and make public colleges, universities, and trade schools tuition free.” @BernieSanders | United States |
| 5832 | 0.7956836 | Student debt cancellation is critical to economic relief…regardless of #COVID19 Economist have predicted a student loans crisis was coming that would parallel the mortgage crisis. It’s time to secure America’s future. Let’s see if Congress can do right by us. | New Jersey, USA |
By examining a three-day window of time in mid-March 2020, we are able to get a sense of some of the most important issues being faced by colleges, universities, and students in the face of the onset of the COVID-19 pandemic.
The technique used here is generalizable to any social media dataset. However, it is important to discuss some of the ways in which the analysis presented here can be extended, as well as some of the technical limitations of the system.
The logical next step is to run the same analysis on several three-day windows after the initial onset of the pandemic - one in mid-April and one in early May. The shift in orientation of the high-level theme clusters should show the evolution of the discussion of the impact of the pandemic on higher education. For example, we should expect discussion on school closures to be less focused on initial logistics and more focused on the uncertainty around fall re-opening.
Additionally, each cluster and subcluster can be augmented with the following information: * An average sentiment score, as computed by a sentiment analysis classifier. * A measure of how many tweets in the cluster or subcluster originate from official accounts. * A breakdown of user locations. It is hard to tell by looking at a cluster whether the majority of the content pertains to the higher education systems in the US, UK, EU, or other parts of the world.
As previously mentioned, term frequency analysis is done to determine the dominant topics of discussion within a cluster and generate the cluster label. The term frequency analysis is done across each whole cluster, leaving the possibility open that the dominant terms do not lie close to the cluster center. This is prone to happen on large, diverse clusters. The result of this sort of occurance is that the top-k nearest neighbors will not contain discourse related to the label of the cluster.
This can be remediated by ensuring a very good balance between cluster and subcluster diversity and homogeneity. However, this causes the labeling quality to be very sensitive to the choice of k for k-means.
Quote tweets contain two text sequences – the quoted text and the tweet text. We embed these tweets in two ways - once using a concatenation of the quoted and tweet text and once using the sum of their individual embeddings. However, the system does not use these combined embeddings yet. Instead, only the tweet text embedding is used in the clustering.
The result is what appears to be a “garbage” cluster such as “covid19 / president / gentlemen”, where almost all tweets are only several words in length:
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 1001 | 0.7106649 | Folks,,, | Washington, DC |
| 943 | 0.7070935 | Americans | Leviticus 18:22 |
| 944 | 0.7070935 | americans | xoxo gossip girl |
| 946 | 0.7070935 | Americans | Little Town in South Yorkshire |
| 945 | 0.7070935 | Americans…. …… .. | Bikini bottom |
The full quoted portion of the tweet must be included and one version of the combined embedding must be included in the clustering. This is an open area of investigation, as tracked in the Github repo how to effectively embed quote tweets.
[1] Sarker A, Gonzalez-Hernandez G, Ruan Y, Perrone J. Machine Learning and Natural Language Processing for Geolocation-Centric Monitoring and Characterization of Opioid-Related Social Media Chatter. JAMA Netw Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672
[2] “Consuming Streaming Data.” Twitter, Twitter, developer.twitter.com/en/docs/tutorials/consuming-streaming-data.
[3] “Elasticsearch: The Official Distributed Search & Analytics Engine.” Elastic, www.elastic.co/elasticsearch/.
[4] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil. Universal Sentence Encoder. arXiv:1803.11175, 2018.
[5] Abadi, Martín, et al. “Tensorflow: A system for large-scale machine learning.” 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 2016.
[6] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.
[7] “Stats.” Function | R Documentation, www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kmeans.
[8] Maaten, Laurens van der, and Geoffrey Hinton. “Visualizing data using t-SNE.” Journal of machine learning research 9.Nov (2008): 2579-2605.